The Clean and the Green

Investigating the Relationship Between Income and Sanitation

Author

Giovani Thai, William Gladden, Soren Paetau

Code
library(tidyverse)
library(ggridges)
library(RColorBrewer)
library(broom)
library(gganimate)
library(gifski)

1 Project Proposal

1.1 Data Description

Code
sanitation.data <- read_csv("at_least_basic_sanitation_overall_access_percent.csv")
income.data <- read_csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
colnames(income.data) <- c("country", as.numeric(1799:2049))

1.1.1 Income Dataset

Code
head(select(income.data, 1:5))
# A tibble: 6 × 5
  country              `1799` `1800` `1801` `1802`
  <chr>                 <dbl>  <dbl>  <dbl>  <dbl>
1 Afghanistan             683    683    683    683
2 Angola                  700    702    705    709
3 Albania                 755    755    755    755
4 Andorra                1360   1360   1360   1360
5 United Arab Emirates   1130   1130   1140   1140
6 Argentina              1730   1730   1740   1740

The income data set measures the total GDP of a country per person for 216 countries between the years of 1892 and 2049, with units in fixed 2017 prices. This number is adjusted for PPP, or the differences between costs of living and essential products, across countries. The data comes from the World Bank, the Maddison Project Database, and the Penn World Table. Historical estimates were used for early years; forecasts from the IMF’s Economic Outlook were used to project income in the future.

Source: https://www.gapminder.org/data/documentation/gd001/

1.1.2 Sanitation Dataset

Code
head(select(sanitation.data, 1:5))
# A tibble: 6 × 5
  country              `1999` `2000` `2001` `2002`
  <chr>                 <dbl>  <dbl>  <dbl>  <dbl>
1 Aruba                  98.3   98.3   98.2   98.1
2 Afghanistan            21.9   21.9   23.3   24.7
3 Angola                 27.6   29     30.4   31.9
4 Albania                89.5   90     90.6   91.2
5 Andorra               100    100    100    100  
6 United Arab Emirates   NA     NA     97.4   97.4

The sanitation data set measures the percentage of people (living in both urban and rural settings) who use at lease basic sanitation services not shared with other households. This includes:

  • flushing/pouring to piped sewer systems
  • septic
  • ventilation for pit latrines (e.g. squat toilets)

The data was collected by the World Health Organization and UNICEF, falling between the years of 1999 to 2019.

Source: https://data.worldbank.org/indicator/SH.STA.SMSS.ZS

1.2 Hypothesized Relationship

We hypothesize that there will be a positive relation between income per person and percentage of people with basic sanitation at their disposal. We would also like to investigate and highlight any potential outliers, such as moments in time where certain countries failed to improve sanitation with increased income, or drastic dips/surges in income/sanitation over a single year.

1.3 Data Cleaning Process

Code
convert_num <- function(x){ #converts "12.3k" to 12300,  
  temp <- x
  if(str_detect(x, "k$")){
    temp <- x |> 
      str_replace_all("[^[:digit:]]", "") |> #gross but it just replaces any non-number with ""
      as.numeric() * 100
  }
  return(temp)
}

#data cleaning + joining
sanitation.clean <- sanitation.data |>
  drop_na() |>
  pivot_longer(cols = `1999`:`2019`,
               names_to = "year",
               values_to = "sanitation")

income.clean <- income.data |>
  drop_na() |> #pretty bold step to remove any rows with na vals, 
               #but due to the abundance of data seems reasonable
  select(c(country, `1999`:`2019`)) |>
  pivot_longer(cols = `1999`:`2019`,
               names_to = "year",
               values_to = "income") |>
  mutate(income = as.numeric(map(income, convert_num)))#converts characters,
  #able to convert after pivot, b/c data is all seen as strings!

The cleaning process above is relatively simple, converting some data entries of the form \(12.3k \to 12300\). Notice that the decision to drop NA values was made, since ggplot and lm will drop NA values and we have an abundance of data entries. This decision will save headaches later down the road when taking advantage of the country column.

Code
full.data <- inner_join(sanitation.clean, income.clean)
head(full.data)
# A tibble: 6 × 4
  country     year  sanitation income
  <chr>       <chr>      <dbl>  <dbl>
1 Afghanistan 1999        21.9    584
2 Afghanistan 2000        21.9    569
3 Afghanistan 2001        23.3   1190
4 Afghanistan 2002        24.7   1240
5 Afghanistan 2003        26.1   1200
6 Afghanistan 2004        27.5   1290

2 Linear Regression

We aim to piece together the relationship between income and sanitation levels.

2.1 Data Visualization

Our goal is to develop a model that predicts the percent of people with basic sanitation (response) from adjusted income per person (explanatory).

2.1.1 Macro-Scale Relation Between Income, Sanitation

We first visualize the relation between these two variables:

Code
full.plot <- full.data |>
  ggplot(mapping = aes(x = income, y = sanitation) ) +
  geom_point() +
  geom_smooth(method = lm, level = 0, color = "red") +
  scale_y_continuous(limits = c(0,100),
                     breaks = seq(0,100,25)) +
  labs(title = "Income versus Sanitation, 1999-2049",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "")
full.plot

Code
temp.plot <- full.data |>
  mutate(year = as.numeric(year)) |>
  ggplot(mapping = aes(x = income,
                       y = sanitation)) +
  geom_point(show.legend = F) +
  transition_time(year) +
  labs(title = "Income versus Sanitation, 1999-2019",
       subtitle = "Year: {floor(frame_time)}",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "Percentage of People with (at least) Basic Sanitation Services")
  
temp.plot

Although the above scatter plot is very messy, we can see a general pattern: countries with low adjusted income per person have lower percentages of people with basic sanitation, whereas countries with high income per person never have less than 90 percent of people with basic sanitation.

However, there are instances where countries with very low income per person have high percentages of people with basic sanitation. Below you can see that the United States has had two years between 1999 and 2019 where adjusted income per person dipped below $20,000, but over 99% of Americans were able to have access to basic sanitation.

Code
full.data |>
  filter(country == "United States") |>
  ggplot(mapping = aes(x = income, y = sanitation)) +
  geom_point() +
  labs(title = "United States Income vs. Sanitation, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Santitation",
       x = "Adjusted Income Per Person (in 2017 $)",
       y = "")

Despite an obvious general trend, further investigation is needed to gain insight on descrepancies like the ones above.

2.1.2 Relationship over Time:

We look at the relationship between average income per person and average percentage of people with basic sanitation at their disposal across 1999-2019.

Code
# take average income and sanitation for each year, then plot?
# cleveland dot plots to better indicate year
# 1999 is lowest, 2019 is highest, everything else already in order.
full.data |>
  group_by(year) |>
  summarise(mean_income = mean(income),
            mean_sanitation = mean(sanitation)) |>
  ggplot(mapping = aes(x = mean_income,
                       y = mean_sanitation#,
                       #color = fct_reorder2(year, mean_income, mean_sanitation)
                       )
         ) +
  geom_segment(mapping = aes(xend = 0, yend = mean_sanitation)) +
  geom_point() +
  annotate("text", x = 17000, y = 76.6, label = "2019") +
  annotate("text", x = 18700, y = 76.1, label = "2018") +
  annotate("text", x = 17600, y = 75.6, label = "2017") +
  annotate("text", x = 19200, y = 75.2, label = "2016") +
  annotate("text", x = 19200, y = 74.6, label = "2015") +
  annotate("text", x = 17500, y = 74.1, label = "2014") +
  annotate("text", x = 17000, y = 73.5, label = "2013") +
  annotate("text", x = 17700, y = 73.0, label = "2012") +
  annotate("text", x = 16900, y = 72.4, label = "2011") +
  annotate("text", x = 17500, y = 71.9, label = "2010") +
  annotate("text", x = 16100, y = 71.3, label = "2009") +
  annotate("text", x = 16600, y = 70.7, label = "2008") +
  annotate("text", x = 17200, y = 70.1, label = "2007") +
  annotate("text", x = 17200, y = 69.5, label = "2006") +
  annotate("text", x = 17200, y = 69, label = "2005") +
  annotate("text", x = 15900, y = 68.5, label = "2004") +
  annotate("text", x = 15200, y = 67.9, label = "2003") +
  annotate("text", x = 14700, y = 67.4, label = "2002") +
  annotate("text", x = 15000, y = 66.9, label = "2001") +
  annotate("text", x = 14400, y = 66.4, label = "2000") +
  annotate("text", x = 14600, y = 65.8, label = "1999") +
  labs(title = "Average Income versus Average Sanitation, 1999-2019",
       subtitle = "Average Percentage of People with Basic Sanitation",
       x = "Average Income per Person (in 2017 $)",
       y = "",
       color = "Year")

Plotting the aggregate average income per person and average sanitation percentage across all countries, the trend seems to follow a increasing trend. Again, further investigation (and possibly historical analysis) is required in otder to decipher the dips in cleanliness around time frames such as the late aughts and the early 2010’s.

2.2 Linear regression

Upon fitting a simple linear regression model for percent of people who have access to basic sanitation (\(Y\)) based on adjusted income per person (\(X\)), we acquired the following regression equation: \[\hat{Y} = 55.75 + 0.0009699X\]

Our regression equation tells us that a country with an average adjusted income of $0 will have 55.75% of its people with access to basic sanitation services. Further, for every ten thousand dollars increase in adjusted income per person, a country should expect a 9.699 percent increase in the amount of people who have access to basic sanitation.

2.3 Model Fit

To assess the validity of our linear model, we look at the variances in observed sanitation percentage, predicted sanitation percentage, and residuals:

Code
model_var <- augment(data.fit) |>
  summarise(var.resp = var(sanitation),
         var.fitted = var(.fitted),
         var.resid =var(.resid))

model_var
# A tibble: 1 × 3
  var.resp var.fitted var.resid
     <dbl>      <dbl>     <dbl>
1     943.       314.      629.

The proportion of variability accounted for by our model was 0.333. Our model does a good enough job to communicate the general idea that more money means people are going to be cleaner. However, the variable we are hoping to predict is a percentage which means it can not exceed 100%, exposing the flaw that our model will predict impossible to reach sanitation percentages for high enough average income inputs. This gives us problems regarding what we can and can not conclude due to risk of extrapolating further than our data provides. There is not much we can do about this using a simple linear model as we would have to introduce a polynomial regression equation (say, through a logarithmic transformation) to be more precise.